Skip to content

Refactor: standard install/start/check/stop/load/query interface per system#860

Open
alexey-milovidov wants to merge 109 commits intomainfrom
refactor/per-system-script-interface
Open

Refactor: standard install/start/check/stop/load/query interface per system#860
alexey-milovidov wants to merge 109 commits intomainfrom
refactor/per-system-script-interface

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

  • Split each local system's monolithic benchmark.sh into 7 single-purpose scripts (install, start, check, stop, load, query, data-size) with a stable contract, driven by a new shared lib/benchmark-common.sh.
  • Wrap dataframe / in-process systems (pandas, polars-dataframe, chdb-dataframe, daft-parquet*, duckdb-dataframe, sirius) in small FastAPI servers so they fit the same start/stop/query lifecycle.
  • 88 local systems refactored; cloud/managed systems and a handful of non-functional ones are intentionally untouched.

Why

Previously, every system's benchmark.sh bundled installation, server lifecycle, dataset download, data loading, and query dispatch into one script — and run.sh hard-coded the per-query orchestration. There was no programmatic per-query entry point, so:

  1. Tweaking the dataset, query set, or per-query behavior (e.g. restarting the system between queries to neutralize warm-process effects) required editing every system's scripts individually.
  2. Building an online "run query X against system Y" service was impossible.
  3. Most run.sh ran all 3 tries inside a single CLI invocation, so OS-cache warmth from try 1 leaked into tries 2/3.

The new per-system interface

Script Stdin Stdout Stderr Notes
install - progress progress Idempotent. Env prep + system install.
start - - progress Start daemon. Idempotent. Empty/exit-0 for stateless tools.
check - - progress Trivial query (e.g. SELECT 1). Exit 0 iff responsive.
stop - - progress Stop daemon. Idempotent.
load - progress progress Runs create.sql + loads data; deletes source files then sync.
query one query query result, any format last line: fractional seconds (0.123) Non-zero exit on failure.
data-size - bytes (one integer) - Reports the data footprint.

Each system's benchmark.sh becomes a 4-line shim that sets a couple of env vars and exec's the shared driver:

#!/bin/bash
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-partitioned"
export BENCH_RESTARTABLE=yes
exec ../lib/benchmark-common.sh

The shared driver runs install → start+check → download → load (timed) → for each query: flush caches; if BENCH_RESTARTABLE=yes, stop+start; run query 3× → data-size → stop. The output log shape (Load time:, [t1,t2,t3], per query, Data size:) is identical to the old benchmark.sh, so cloud-init.sh.in's POST to play.clickhouse.com keeps working unchanged.

BENCH_RESTARTABLE=no for embedded CLIs (duckdb, sqlite, datafusion, …) and dataframe wrappers — restarting a single CLI/Python process between queries would dominate query time. For these, OS caches are still flushed between queries.

Scope

Refactored (88 systems):

  • Server, restartable: clickhouse, postgresql, mysql, mariadb, monetdb, druid, pinot, vertica, exasol, kinetica, heavyai, questdb, cockroachdb, elasticsearch, ydb, … and the postgres/clickhouse/mysql variants (timescaledb, citus, paradedb, postgresql-indexed, clickhouse-parquet*, clickhouse-datalake*, mysql-myisam, tidb, infobright, …)
  • Embedded CLI, not restartable: duckdb (and variants), sqlite, datafusion (and partitioned), glaredb (and partitioned), hyper, hyper-parquet, octosql, opteryx, sail (and partitioned), drill, turso, chdb, chdb-parquet-partitioned
  • Dataframe with FastAPI wrapper, not restartable: pandas, polars-dataframe, chdb-dataframe, daft-parquet, daft-parquet-partitioned, duckdb-dataframe, sirius
  • Spark family: spark, spark-auron, spark-comet, spark-gluten

Not refactored (intentionally out of scope):

  • Cloud / managed: alloydb, athena, aurora-{mysql,postgresql}, bigquery, clickhouse-cloud, databricks, motherduck, redshift, redshift-serverless, snowflake, hydrolix, firebolt(), hologres, tinybird, hydra, mariadb-columnstore, pg_duckdb, singlestore, supabase, tablespace, tembo-olap, timescale-cloud, crunchy-bridge-for-analytics, s3select, …
  • Non-functional: csvq, dsq, locustdb (panic on first query); exasol, spark-velox (empty dirs)
  • Non-SQL or no SQL CLI: mongodb (JS aggregation pipelines), polars (no SQL CLI; the dataframe variant is wrapped instead)

Validated end-to-end on a 96-core / 185 GB ARM machine

System Data Outcome
clickhouse 14.2 GB / 100M rows Full 43 queries × 3 tries with stop/start between queries; load 124s
duckdb 20.6 GB / 100M rows Full 43 queries × 3 tries (no restart); load 69s
pandas 4.2 GB in-mem (5M-row subset) 42/43 queries; Q43 hit a pandas lambda bug → recorded as null (framework's error path works)
sqlite 3.9 GB (5M-row subset) First 5 queries × 3 tries; load 68s
postgresql 100M rows / 75 GB TSV First 3 queries × 3 tries with restart; load 829s. Cold-cache spike clearly visible (135s → 7s after warmup) — confirms per-query restart actually flushes the page cache

All 88 refactored systems pass bash -n and have executable bits set on the 7 scripts + benchmark.sh.

Bug fixes surfaced during validation

  • lib/benchmark-common.sh: data-size now runs before stop (clickhouse and pandas need the server up to report size).
  • clickhouse/start: idempotent (was erroring when already running).
  • duckdb/load, sqlite/load: rm -f hits.db/mydb for idempotent reruns.
  • postgresql/load: -v ON_ERROR_STOP=1 so COPY data errors actually fail the script instead of silently rolling back.
  • BENCH_DOWNLOAD_SCRIPT may now be empty for systems that read directly from S3 datalakes / remote services (clickhouse-datalake*, duckdb-datalake*, chyt, …).

Flagged for follow-up review

  • duckdb-memory:memory: semantics force a per-query reload; will inflate timings vs. the original single-process flow.
  • cloudberry, greenplum — multi-phase install (reboot between phases); the shim only runs phase 1.
  • sirius — GPU-dependent; long-lived duckdb CLI subprocess proxy; review the stdin/sentinel protocol.
  • paradedb*, pg_ducklake, pg_mooncake — Docker container created in install then docker cp in load (small divergence from the original docker run -v ... due to the lifecycle order: start runs before download).

Test plan

  • bash -n on all 88 systems' scripts
  • clickhouse: full 43-query benchmark.sh on 100M-row real data
  • duckdb: full 43-query benchmark.sh on 100M-row real data
  • pandas: 43-query benchmark.sh on a 5M-row subset
  • sqlite: abbreviated benchmark.sh on a 5M-row subset
  • postgresql: abbreviated benchmark.sh on full 100M-row data
  • Smoke-run on a fresh c6a.metal/equivalent VM via cloud-init for a representative system from each family before merging
  • Verify play.clickhouse.com log-ingestion sink continues to parse the output for at least one production benchmark run

🤖 Generated with Claude Code

alexey-milovidov and others added 3 commits May 7, 2026 12:14
…/data-size

Each local system now exposes a small set of single-purpose scripts with a
stable contract, so they can be driven by a shared lib/benchmark-common.sh
and reused by external tooling (e.g. an online "run query against system X"
service):

  install     env prep + system install (idempotent)
  start       start daemon (idempotent; empty for stateless tools)
  check       trivial query, exit 0 iff responsive
  stop        stop daemon (idempotent)
  load        runs create.sql + loads data, deletes source files, sync
  query       SQL on stdin; result on stdout; runtime in fractional seconds
              on the last line of stderr; non-zero exit on error
  data-size   prints data footprint in bytes (one integer to stdout)

Each system's old monolithic benchmark.sh is replaced by a 4-line shim that
sets a couple of env vars (BENCH_DOWNLOAD_SCRIPT, BENCH_RESTARTABLE) and
exec's lib/benchmark-common.sh. The shared driver runs the unified flow:
install -> start+check -> download -> load (timed) -> for each query
{flush caches; optionally stop+start to neutralize warm-process effects;
run query 3x} -> data-size -> stop. Output format ([t1,t2,t3], Load time,
Data size) matches the previous benchmark.sh exactly so cloud-init.sh.in's
log POST to play.clickhouse.com keeps working unchanged.

For dataframe/in-process systems (pandas, polars-dataframe, chdb-dataframe,
daft-parquet*, duckdb-dataframe, sirius), the engine is wrapped in a small
FastAPI server (server.py) so the start/stop/query interface still applies.
BENCH_RESTARTABLE=no for these (and for embedded CLIs like duckdb, sqlite,
datafusion, etc.) since restarting a single Python/CLI process between
queries would dominate query time.

Scope: 88 local systems refactored. Cloud/managed systems and a handful of
non-functional ones (csvq, dsq, locustdb, mongodb, polars CLI, exasol,
spark-velox) are intentionally left untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflict in clickhouse-datalake{,-partitioned}: upstream switched
the datalake variants from filesystem-cache to userspace page-cache (PR #818).
The refactored install/query scripts now adopt the page-cache approach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mongodb: query takes a MongoDB aggregation pipeline (Extended JSON, one
line) on stdin instead of SQL — these are the same canonical 43 ClickBench
queries, just expressed as mongo pipelines. queries.txt is generated from
queries.js (the source of truth) by replacing JS-only constructors
(NumberLong, ISODate, NumberDecimal) with their EJSON canonical form. The
shim sets BENCH_QUERIES_FILE=queries.txt to point the driver at it.

polars: wrapped in a FastAPI server analogous to polars-dataframe, but the
load step uses pl.scan_parquet (LazyFrame) so the parquet file remains
needed at query time — the load script does NOT delete hits.parquet.
data-size returns the on-disk parquet size since a LazyFrame has no
materialized in-memory size.

Both systems now expose the standard install/start/check/stop/load/query/
data-size scripts and a 4-line benchmark.sh shim, removing the old
benchmark.sh / run.js / query.py / formatResult.js paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread clickhouse-datalake-partitioned/load Outdated
…use in query

Per review: clickhouse-local persists table metadata in its --path dir, so
the CREATE TABLE only needs to run once during ./load. ./query just runs
the query against the persisted table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread clickhouse/query Outdated
Comment thread clickhouse/start Outdated
alexey-milovidov and others added 3 commits May 7, 2026 12:29
…atively

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… readiness

Per review (alexey-milovidov): clickhouse start leaves the system in the
desired state (server running) even when it returns non-zero with "already
running". Make the shared driver tolerate non-zero from ./start and rely on
bench_check_loop as the authoritative readiness signal. This lets per-system
start scripts stay simple — they just need to make a best-effort attempt to
launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prmoore77 added a commit to gizmodata/ClickBench that referenced this pull request May 7, 2026
…ouse#860)

Adopts the per-system 7-script interface from ClickHouse#860 for gizmosql/, and
replaces the Java sqlline-based gizmosqlline client with the C++
gizmosql_client shell that ships with gizmosql_server.

Scripts (matching the contract from lib/benchmark-common.sh):
  benchmark.sh - 4-line shim that exec's ../lib/benchmark-common.sh
  install      - apt + curl gizmosql_cli_linux_$ARCH.zip; no openjdk, no
                 separate gizmosqlline download
  start        - idempotent server bring-up (skips if port 31337 is open)
  check        - cheap TCP probe (auth-gated SQL would need credentials)
  stop         - kills tracked PID; pkill belt-and-braces fallback
  load         - rm -f clickbench.db, then create.sql + load.sql via
                 gizmosql_client; deletes hits.parquet and sync's
  query        - reads one query from stdin, runs via gizmosql_client with
                 .timer on + .mode trash; emits fractional seconds as the
                 last stderr line (parsed from "Run Time: X.XXs")
  data-size    - wc -c clickbench.db

Notes:
- BENCH_DOWNLOAD_SCRIPT=download-hits-parquet-single, BENCH_RESTARTABLE=yes
  (gizmosql is a server, so per-query restart neutralizes warm-process
  effects, matching the clickhouse/postgres pattern in ClickHouse#860).
- util.sh now exports GIZMOSQL_HOST/PORT/USER/PASSWORD - the env vars
  gizmosql_client reads natively, so query/load can call gizmosql_client
  with no flags. The server still receives the username via --username.
- PID_FILE moved to a stable /tmp path (was /tmp/gizmosql_server_$$.pid,
  which broke across the start/stop process boundary in the new layout).

This PR depends on ClickHouse#860 (which introduces lib/benchmark-common.sh and the
contract). Once ClickHouse#860 lands, this PR's diff against main will be only
the gizmosql/ files. Validated locally on macOS with gizmosql v1.22.4:
the query script produces the expected fractional-seconds last line on
stdout/stderr separation, and exits non-zero on error paths.

See https://docs.gizmosql.com/#/client for gizmosql_client docs.
alexey-milovidov and others added 18 commits May 9, 2026 01:22
Resolves merge conflicts:

- Removed cedardb/run.sh, gizmosql/run.sh — superseded by the standard
  query interface; the refactor branch already replaced them.
- Restored datafusion{,-partitioned}/make-json.sh, doris{,-parquet}/get-result-json.sh
  with main's dated-results version. These are independent post-run JSON
  builders, still referenced from the per-system READMEs.
- Kept the thin benchmark.sh shim in gizmosql/, spark-{auron,comet,gluten}/,
  trino/. Per-system result-JSON auto-save (added on main while this branch
  was in flight) is intentionally not carried over: under the new interface,
  result.csv is the single timing artifact and JSON construction belongs in
  separate tooling.
- gizmosql/{install,load,query,util.sh}: merge auto-took main's switch from
  gizmosqlline (Java) to gizmosql_client (CLI shipped with the server),
  but the refactor branch's load/query still referenced GIZMOSQL_SERVER_URI
  and GIZMOSQL_USERNAME. Updated install to drop openjdk + gizmosqlline,
  load to use gizmosql_client (and stop the server first to release the
  database file), and query to drive gizmosql_client with .timer/.mode trash
  and parse "Run Time:" instead of "rows selected (... seconds)".
…-system layout

These four entries were added on main while this branch was in flight (the
existing trino/ scripts here were a memory-connector stub that never worked
end-to-end). Rebuild each one against the new install/start/check/stop/load/
query/data-size contract so they share lib/benchmark-common.sh:

- trino, trino-partitioned: Hive connector + file metastore + local Parquet
  hardlinked into data/hits/ (matches main's working impl from PR #856).
- trino-datalake{,-partitioned}: same, plus the AnonymousAWSCredentials shim
  to read clickhouse-public-datasets/hits_compatible/athena from anonymous
  S3 (the published bucket size is reported by data-size since the data is
  read on demand). BENCH_DOWNLOAD_SCRIPT="" — no local dataset to fetch.
- benchmark.sh in all four becomes a 4-line shim. Old run.sh deleted.
…r-system layout

These four entries were added on main while this branch was in flight.
Adapt them to the install/start/check/stop/load/query/data-size contract:

- presto, presto-partitioned: Hive connector + file metastore + local Parquet
  hardlinked into data/hits/.
- presto-datalake{,-partitioned}: same plus the AnonymousAWSCredentials shim
  (compiled in a throwaway trinodb/trino container, since the prestodb image
  ships only a JRE) so the hive-hadoop2 plugin can read the public bucket
  anonymously. BENCH_DOWNLOAD_SCRIPT="" — schema-only load against S3.

Each benchmark.sh becomes a 4-line shim. Old run.sh deleted.
These two entries were added on main while this branch was in flight.
Adapt to the install/start/check/stop/load/query/data-size contract:

- BENCH_DOWNLOAD_SCRIPT="" — the vortex bench binary fetches Parquet and
  converts to .vortex on first invocation.
- BENCH_RESTARTABLE=no — embedded Rust CLI; per-query restart would
  dominate query time.
- query: stages stdin into a temp queries-file and passes -q 0, since the
  bench binary addresses queries by index rather than reading SQL on stdin.
- The single variant uses the `clickbench` binary (vortex 0.34.0); the
  partitioned variant uses `query_bench clickbench` (vortex 0.44.0). Old
  run.sh deleted.
Quickwit was added on main while this branch was in flight. Adapt to the
install/start/check/stop/load/query/data-size contract:

- BENCH_QUERIES_FILE="queries.json" — Quickwit accepts Elasticsearch-format
  JSON queries via the /_elastic compat API, not SQL. queries.json holds one
  ES query per line; queries not expressible in Quickwit are encoded as the
  literal "null".
- BENCH_DOWNLOAD_SCRIPT="" — the load script fetches hits.json.gz directly
  (there is no shared download-hits-json helper) and pipes it through
  `quickwit tool local-ingest`, since v0.9's sharded ingest-v2 endpoint caps
  single-node throughput at a few MB/s.
- BENCH_RESTARTABLE=yes — relies on the common driver's per-query restart
  to flush Quickwit's fast_field_cache and split_footer_cache (the result
  caches are already disabled in node-config.yaml).
- query: returns non-zero for "null" queries so the framework records null
  in the per-query timing array; otherwise reports .took (ms → seconds).

Old run.sh deleted.
The original used /tmp/gizmosql_server_$$.pid where $$ is the calling
process's PID. That worked when benchmark.sh sourced util.sh and called
start/stop in the same shell, but under the new per-system layout each of
start, stop, load, and query sources util.sh in its own subshell — so
stop_gizmosql couldn't find the PID file written by start_gizmosql. Use a
fixed path under the system directory instead. Also expose wait_for_gizmosql
so callers (like load) can wait for readiness without restarting.
Conflict only in gizmosql/benchmark.sh — kept the thin shim. Main switched
gizmosql to the official one-line installer (PR #879); fold that into
gizmosql/install so we stop hand-detecting arch and downloading the zip.

Other changes auto-merged: quickwit/index_config.yaml gained tag_fields on
CounterID + record:basic on text fields (PR #886), and assorted result
JSONs for ClickHouse Cloud / Citus / Cratedb / etc.
start/stop scripts may emit progress lines (clickhouse-server prints PID
table tracking, sudo's chown invocation, postgres's startup messages,
etc.). With BENCH_RESTARTABLE=yes those scripts run before every query,
so their output interleaves with the parseable [t1,t2,t3] / Load time /
Data size lines and breaks the cloud-init log POST to play.clickhouse.com.

Redirect both stdout and stderr from ./start and ./stop to /dev/null at
the three call sites in lib/benchmark-common.sh. The check loop is the
authoritative readiness signal, so losing start's output costs nothing
in steady state; for debugging, run ./start manually outside the driver.
The DuckDB installer at install.duckdb.org drops the binary into
~/.duckdb/cli/latest/duckdb and only suggests adding that directory to
PATH. Previously each install attempted a per-user symlink into
~/.local/bin, which silently no-ops when that directory isn't on PATH
(default for root in cloud-init). The result was ./check failing for
300s with no useful error.

Symlink to /usr/local/bin/duckdb via sudo right after install instead;
that's on PATH for every user, and the symlink is itself idempotent.
Ubuntu's docker.io ships the docker CLI without the v2 compose plugin, so
the existing `command -v docker` short-circuit skipped installation on
boxes that already had docker but no `docker compose`. ./start then ran
`docker compose up -d`, which silently failed, and ./check timed out at
300s. Fall back to docker-compose-v2 for the Ubuntu package name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Throughput variant of ClickBench. N connections (default 10) hold open
sessions and each picks a uniformly random query from the standard
43-query set; the run goes for a fixed wall-clock window (default 600s)
after a warmup. Reports completed queries, QPS, latency p50/p95/p99,
and per-query mean.

Backends: ClickHouse over HTTP (stdlib http.client), StarRocks over the
MySQL wire protocol (pymysql). Each system's recommended path so neither
is paying a wire-format penalty the other isn't.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ned}/query: pass query via temp file

`python3 - <<'PY' ... PY` directs the heredoc into python3's stdin so the
interpreter can read its program from there. Once the heredoc is fully
consumed, sys.stdin (the same FD) is at EOF — so sys.stdin.read() inside
the heredoc returned an empty string, and chdb / hyper / sail dutifully
ran the empty query and reported ~0.000s for every try.

Stage stdin into a temp file in bash before invoking the heredoc and pass
the path as argv[1]; the python script reads the query from that file.

Also include result materialization in the timing window for chdb/query
and chdb-parquet-partitioned/query (move `end = ...` past fetchall /
str(res)) — the timer was previously stopped before the result was
realized, which would have under-counted query time even when the stdin
bug wasn't masking it entirely.
Right now ./check stderr is silently dropped while the loop retries for
300s, then we report "did not succeed within 300s" with no clue why.
For deterministic failures (missing env var like YT_PROXY for chyt, an
install step that didn't run, etc.) the user wastes 5 minutes and still
has to dig through the per-system check script to find out what
happened. Capture the last attempt's stderr and print it on timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The upstream install path assumes RHEL/Rocky/Alma — yum, grubby, SELinux,
the wheel group, /data0. On Ubuntu/Debian the prereqs phase silently
half-completes (several |||| true skips), the gpadmin user is sometimes
not created, and db-install would later die at `yum install -y go`.
Either way ./check times out at 300s with no diagnostic. Bail with a
clear "needs yum" message before doing anything destructive, and call
out the requirement in the README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cloud-init runs scripts as root with HOME unset. Tools that follow
XDG-ish conventions then fall over: the GizmoSQL one-line installer
exits at line 32 with "HOME: parameter not set" (it runs under `sh -u`),
duckdb-vortex's `INSTALL vortex` writes to /.duckdb/extensions/... and
later fails to find it ("Extension /.duckdb/extensions/v1.5.2/..."),
and duckdb-datalake{,-partitioned} queries crash 43 times each with
"Can't find the home directory at ''" while autoloading httpfs.

Each affected install script tried to paper over this locally with
`export HOME=${HOME:=~}`, but the export only lives for that script —
the sibling load/query scripts the lib runs in fresh subprocesses still
see HOME unset. Set it once here so every per-system step inherits it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
apt's monetdb5-sql post-install creates /var/lib/monetdb as the monetdb
user's home dir, so the existing `if [ ! -d /var/lib/monetdb ]` guard
skipped `monetdbd create` and left the dbfarm uninitialized. ./check
then looped 300s on `mclient: cannot connect: control socket does not
exist` and the run died.

Probe the dbfarm marker file (.merovingian_properties) instead of the
directory, and explicitly `monetdbd start` after create — both are
idempotent, and a daemon that's already up just no-ops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paradedb/paradedb:0.10.0 (the prior pin) was rotated out of Docker Hub —
docker pull returned "manifest not found" and ./check timed out. The
oldest tags still hosted are 0.15.x, so move both directories onto a
real Postgres-version-specific tag (latest-pg17) that paradedb still
maintains.

This unblocks the image pull. NOTE: paradedb dropped its pg_lakehouse /
parquet_fdw extension after 0.10.x (the parquet_fdw_handler() function
no longer exists), so create.sql still needs to be reworked away from
the foreign-table approach for queries to succeed end-to-end. That's a
separate change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior URL (qa-build.oss-cn-beijing.aliyuncs.com selectdb-doris-2.1.7-rc01)
returned 404 — SelectDB stopped publishing free standalone tarballs once
the product moved fully to a managed-cloud offering. VeloDB (the company
that now stewards SelectDB) hosts the official Apache Doris release
binaries instead, which are functionally what SelectDB ships today.

Pin to the current stable (4.0.5) and use the symmetric $dir_name path
layout that doris/install already uses, instead of the hardcoded
selectdb-doris-2.1.7 segment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alexey-milovidov and others added 7 commits May 9, 2026 21:53
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-05-09 c6a.4xlarge run reports data_size=16.18 GB (the full
load is ~37 GB) and Q21 cold/warm of 17 ms / 1 ms (a real 100M-row
URL-LIKE scan is ~38 s cold / ~80 ms warm). Same shape across Q22,
Q23, Q24, Q26, Q27 — a partial COPY left a tiny surviving subset
that all queries ran fast over. The umbra/load row-count assertion
that just landed will fail this kind of run loudly going forward;
this file predates that check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two unrelated all-43-nulls failures observed on c6a.4xlarge runs in the
last 14 hours.

kinetica/query:
  kisql 7.2+ reports timings as "Timing (seconds): Connection=X,
  Query=Y" but the parser still grepped for the legacy "Query
  Execution Time: <s> sec" footer, so every query came back with
  "no Query Execution Time in kisql output" → null. Accept the new
  format (preferred) and keep the old fallback. Also tighten the
  error sniff to anchor "^(error|exception)" so the load step's
  "WARNING: Skipped: 1, inserted ..." doesn't get treated as fatal.

presto/install (and presto-partitioned, presto-datalake,
presto-datalake-partitioned):
  Hardcoded -Xmx48G with query.max-memory=24GB exceeds physical RAM
  on c6a.4xlarge (32 GiB) — JVM tries to grow into swap, earlyoom
  kills it, and queries fail with "java.io.IOException: unexpected
  end of stream on http://localhost:8081/...". Compute the heap
  (~70% of /proc/meminfo MemTotal) and downstream query-memory caps
  from host RAM at install time so the configuration scales to both
  4xlarge and metal/48xl-class machines. Trino is unaffected because
  it never overrode the container default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shared spark*/query.py uses a few SparkSession.builder config keys
that were dropped/renamed in pyspark 4.0 (notably .config('spark.driver',
'local[*]')). On 4.0 the SparkSession startup fails silently before any
query runs, the script exits without ever printing a numeric timing
line, and the lib records null for every query — observed as 43-null
result rows on the latest c6a.4xlarge run. Match the version used by
the other refactored Spark variants (spark-auron 3.5.5,
spark-comet 3.5.6, spark-gluten 3.5.2, spark-velox 3.5.2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The script piped each query to drill-embedded with `printf '%s\r'`,
relying on \r as the line terminator. sqlline (drill-embedded's REPL)
on a non-TTY pipe doesn't treat \r as Enter on Linux — the query sat
buffered, EOF arrived, and sqlline exited without firing the SQL.
Every benchmark run produced 43 null timings.

Switch to writing the query to a tempfile, mount it into the container,
and use `drill-embedded --run=/q.sql` (sqlline's script mode). Also:
- Tolerate drill exiting 0 on a failed query: sniff for Error /
  Aborting command set / No current connection / Java stack traces.
- Tighten the timing regex to "(N rows? in X.YYY seconds?)" so we
  don't misparse other parenthesised numbers in the output.
- Correctly recognise "1 row" (singular) as well as "N rows".

Caveat: apache/drill's only published image is linux/amd64. On arm64
hosts the JVM hits a NoClassDefFoundError on RootAllocator init under
QEMU emulation; that's an upstream packaging issue, not a script issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six systems were still on the old monolithic benchmark.sh and were
therefore excluded from the c6a.4xlarge per-system-script-interface
batch. Split each into install / start / stop / check / load / query /
data-size + a thin lib-driven benchmark.sh shim.

  pg_duckdb              TSV ingest + COPY FREEZE; force_execution=true
  pg_duckdb-indexed      parallel COPY shards + indexes via index.sql
  pg_duckdb-parquet      bind-mounts hits.parquet, view-only (no ingest)
  pg_duckdb-motherduck   no local data, CTAS into MotherDuck, REQUIRES
                         MOTHERDUCK_TOKEN; data-size returns the source
                         parquet size so the post-load >5GB sanity
                         check doesn't false-positive on cloud-stored
                         data
  ursa                   ClickHouse-derivative, mirror of clickhouse/
  yugabytedb             yugabyted standalone

Notes:
- pg_duckdb / pg_duckdb-indexed / pg_duckdb-parquet pass postgres
  tuning (shared_buffers, max_*_workers, duckdb.max_memory, etc.) via
  `postgres -c k=v` at docker-run time so the cluster picks them up
  without a second restart, replacing the old "append to
  postgresql.conf then docker restart" dance.
- pg_duckdb-parquet downloads hits.parquet in install (not via lib's
  bench_download) because the container needs the file bind-mounted at
  start time, before lib's download phase runs.
- All six set BENCH_RESTARTABLE=yes so the cold/warm/warm methodology
  applies (the lib's stop -> wait -> drop_caches -> start sequence is
  what the postgres-style cache invalidation needs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ILE to cpimport

Two coupled fixes:

1. Refactor to per-system-script-interface (install/start/stop/check/load
   /query/data-size + lib-driven benchmark.sh). The entry was on the old
   monolithic format, so it was excluded from the c6a.4xlarge batch.

2. Switch the data load from `LOAD DATA LOCAL INFILE` to `cpimport`.
   ColumnStore's recommended bulk path is cpimport — the SQL-layer
   LOAD DATA INFILE the entry used could not handle the 75 GB hits.tsv
   and died after ~5 min with the cryptic
       ERROR 1030 (HY000): Got error -1 "Internal error < 0
       (Not system error)" from storage engine ColumnStore
   that the README documented as "we couldn't reproduce, MariaDB has
   no public issue tracker." cpimport reads STDIN, so we can pipe the
   host-side file straight in without docker cp.

Also:
- New mariadb client requires SSL by default; the columnstore image's
  server doesn't support SSL. Pass --skip-ssl everywhere.
- Container-side server is provisioned + a per-user GRANT issued in
  ./start; idempotent on subsequent restarts.
- ./query parses both "(X.YYY sec)" and "(M min S sec)" forms and
  correctly converts to fractional seconds.

Tested locally on Ubuntu 26.04 / arm64 (mariadb/columnstore image is
multi-arch):
  - install + start + provision + GRANT all idempotent.
  - cpimport of 100k rows: 1000 rows/s sustained, no errors.
  - Q1/Q2/Q5/Q21 against the 100k subset return correct counts with
    sane timings (28 ms / 1.027 s / 31 ms / 96 ms).
  - stop -> start -> check round-trip works; data persists.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alexey-milovidov and others added 22 commits May 10, 2026 00:10
The previous embedded-Python-per-query design re-loaded the entire
hits.parquet into a fresh DuckDB :memory: connection on every ./query
invocation. That made every "query" measurement actually be a
full ingest (~minutes on the full dataset), dwarfing the actual SQL
execution time and producing wildly inflated numbers — which is also
why the entry was producing 0 rows in recent c6a.4xlarge runs (each
query took longer than the lib's per-query slot).

Mirror the duckdb-dataframe / pandas / polars-dataframe layout instead:
a uvicorn/FastAPI server (server.py) holds one DuckDB connection with
the compressed_mem :memory: schema loaded once via /load. start/stop
manage the python pid; check is GET /health; query is POST /query
with the SQL on the request body. data-size returns the server
process RSS as a proxy for the in-memory compressed footprint.

BENCH_RESTARTABLE was already "no" (the lib doesn't restart the
server between queries), which is exactly what we want — restarting
would dump the in-memory compressed state and force a full re-ingest
for every query, which is the bug we're fixing.

Tested locally on a 1M-row hits_0.parquet sample:
  load    1.547 s
  Q1      0.002 s   (count = 1000000)
  Q21     0.011 s   (URL LIKE)
  RSS     1.17 GB

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 8 entries that load the dataset into memory and serve queries from
a long-lived Python process — pandas, polars-dataframe, duckdb-dataframe,
duckdb-memory, chdb-dataframe, daft-parquet, daft-parquet-partitioned,
sirius — already carried the "in-memory" tag in 5 of the 8 templates.
Backfill the other 3 (daft-parquet, daft-parquet-partitioned, sirius)
and add the tag to every historical result that doesn't already have it
so the dashboard's tag-based filtering stays consistent.

67 files updated. Diff is minimal (one trailing-comma change + one new
tag per file) — patched in place rather than re-pretty-printing JSON,
so the existing single-line / multi-line / indented styles in the
result files are preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ckHouse/ClickBench into refactor/per-system-script-interface
…alse

Investigating ClickHouse's ~2 s cold-Q40 floor on c6a.4xlarge: the
default async_load_databases=1 makes the server bind its listen port
and answer SELECT 1 before user-database parts have finished loading.
The lib's bench_check_loop then sees ./check pass, drop_caches+restart
looks "ready", and the first real query — Q40, the heaviest in the
suite — stalls 2-3 s waiting for the part loader to finish.

Measured locally on this 96-core arm box (NVMe):

  async_load_databases=1 (default):
    SELECT 1 ready at:                       0.20 s
    First SELECT count() FROM hits:          2.89 s   <-- waiting on parts
    Q40 (parts now loaded):                  0.33 s

  async_load_databases=0 (this commit):
    SELECT 1 ready at:                       0.12 s
    First Q40 cold (parts already loaded):   0.25 s
    Q40 warm:                                0.085 s

A ~12x cold-run improvement on Q40 (and similar on every other "cold"
measurement, since the wait is pre-paid into the bench_start step
where it belongs instead of the first query's timer). Drop a
config.d/async_load_databases.xml override in both clickhouse/install
and clickhouse-tencent/install (the two refactored entries that
install via `clickhouse install --noninteractive`).

Other CH-family entries don't need this: clickhouse-{datalake,parquet}*
use `clickhouse local` (embedded, no daemon), clickhouse-{cloud,web}
are managed services we can't reconfigure, byconity uses its own
docker-compose stack, ursa is a separate binary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same setting (async_load_databases=false), just stored as
config.d/async_load_databases.yaml instead of an XML snippet — matches
the YAML config style preferred elsewhere in the project. Verified the
clickhouse-server picks it up (system.server_settings shows the value
applied) and the ~12x cold-Q40 improvement is intact (0.25 s here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ewarm

Follow-up to {clickhouse,clickhouse-tencent}/install: forcing
async_load_databases=false ensures parts are loaded by the time the
server reports ready, but marks and primary indices were still loaded
lazily on first column access. So the FIRST cold query after a fresh
restart paid the mark/PK-load cost; subsequent queries against the
same columns were fast.

Adding the per-table prewarm settings — prewarm_mark_cache,
prewarm_primary_key_cache, min_bytes_to_prewarm_caches — instructs
the engine to populate those caches during startup, again moving the
work out of the cold-query timer into bench_start where it belongs.

Local measurements stacked over the async_load_databases fix:

  default (async_load=1, prewarm=0):    cold=2.89s   warm=0.085s  (34x)
  async_load=0, prewarm=0:              cold=0.25s   warm=0.085s  (3x)
  async_load=0, prewarm=1 (this commit): cold=0.19s  warm=0.085s  (2.2x)

The remaining ~0.10 s cold/warm gap is OS pagecache misses for the
actual column data, which can't be eliminated without keeping data
resident across the restart (which would defeat the cold-restart
methodology).

Skipped ursa (older fork; settings may not exist) and the managed
clickhouse-{cloud,web} entries (can't change settings server-side).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cache prewarm"

Prewarming the mark and primary-key caches at startup is a CH-specific
optimization that other systems in the suite don't get an equivalent
of, so applying it here gives ClickHouse an unfair advantage on the
"cold" measurement vs. systems that genuinely do load metadata on
first query. Keep async_load_databases=false (correctness — that just
ensures ./check doesn't pass before parts exist) but drop the prewarm.

Cold Q40 goes from 0.19 s back to ~0.25 s — still a ~12x improvement
over the original 2.89 s, all from the async_load_databases fix alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "serverless" tag should only mark actual cloud services (where the
user pays per-query and the vendor manages the runtime). It was
mistakenly applied to embedded/in-memory engines that run entirely on
the benchmark machine: chdb, chdb-dataframe, chdb-parquet-partitioned,
clickhouse-web (browser WASM), daft-parquet, daft-parquet-partitioned,
glaredb, glaredb-partitioned, opteryx, pandas.

Cloud services keep the tag: bigquery, motherduck, pg_duckdb-motherduck.
Applied across templates and historical result files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ckHouse/ClickBench into refactor/per-system-script-interface
async_load_databases:false alone wasn't enough — even after the user
database loads synchronously, MergeTree still loads the in-memory
primary key and per-column .size streams lazily on the first query.
With ~25 parts × ~80 columns that's >1.8k file opens on the cold path,
adding several hundred ms beyond the disk I/O the benchmark is meant
to measure.

Toggling primary_key_lazy_load=0 and
columns_and_secondary_indices_sizes_lazy_calculation=0 shifts that
work into ./start (which isn't measured). Q40 cold ~3.0s → ~0.9s on
c6a.metal; same I/O happens either way, just before the timer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ypes

The default auto_statistics_types = 'minmax, uniq' silently materializes
per-column statistics on every merge — even when CREATE TABLE has no
STATISTICS clause. The uniqState blobs grow with column cardinality
(~140 KB for UserID/URL/etc.); summed across the wide hits schema each
part's statistics.packed lands at ~4 MB.

On a cold restart the per-query Loading statistics path reads all those
files synchronously before the planner can prune parts (cached_estimator
is populated by an async bg task that hasn't run yet by the time
SELECT 1 succeeds). With ~25 parts post-load that's 100 MB of cold reads
on the critical path of every first query.

Setting auto_statistics_types = '' stops the files from being written
on subsequent merges. Verified: after OPTIMIZE FINAL the merged part
has no statistics.packed, and the Loading statistics log line and its
0.4–0.9 s of work disappear from the cold-query trace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PySpark's console progress bar writes lines like
    [Stage 1:>... (0 + 96) / 111]\r\r   spaces   \r
i.e. uses carriage returns to redraw in place, no trailing \n. The
timing print from query.py ends up appended to that same logical line,
and the strict ^[0-9]+(\.[0-9]+)?$ regex never matches — so even
successful runs were recorded as null. That accounts for the
"Q0..Q37 all null, Q38+ succeed" pattern across all spark variants
(Q38+ are filtered queries that finish before stage progress is
reported, so their stderr happens to end cleanly).

Translating \r to \n before grep splits each progress-bar update
into its own line and the timing line stands alone. Verified by
re-running spark Q1, Q5, Q10, Q19, Q30, Q38, Q40 locally —
old parser returns NULL for 7/8, new parser returns the real
timing for 8/8. Q20 (1.85s) was the one that already worked
because it finished before the bar appeared.

Other "high null rate" systems (presto, trino, starrocks,
octosql, databend, siglens) use independent timing channels —
shell `date`, curl `-w`, custom regex on CLI output — so this
issue is Spark-specific. Their nulls are real query failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that together close the gap between this server and the
duckdb-memory one (which never had these problems).

1. /query now executes whatever SQL the request body contains instead
   of looking it up in a hardcoded 43-entry table. The whitelist was
   a relic of the pre-server.py refactor — it bought nothing (the
   strings are identical to queries.sql) and turned every queries.sql
   tweak into a silent 404. Also removes the dead _make_runner factory
   that was never called.

2. install pulls in pytz. DuckDB needs it whenever it has to
   materialise a tz-aware pandas timestamp back to Python; with the
   official Athena parquet (EventTime: int64) it doesn't bite, but
   any future schema change that hands DuckDB a datetime64[ms, UTC]
   would 500 on Q24 (SELECT * ... ORDER BY EventTime). Cheap to
   install, expensive to debug after the fact.

Verified on a 100k-row slice of the official parquet: all 43 queries
return non-null timings via /query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ColumnStore (MCS-1001) doesn't support regexp_replace in the storage
engine, so the original Q29 fails with all three attempts erroring
out and the bench records [null, null, null]. The InnoDB-backed
mariadb/queries.sql has the function and works fine; this is purely
an MCS engine limitation.

Equivalent rewrite using SUBSTRING_INDEX + REPLACE: strip the optional
"www." prefix, take everything after "://" (or the whole string if no
protocol), then take everything before the first "/". For all URLs
with a protocol and trailing slash — which is essentially every row
in the hits dataset — this produces the same key as the regex.

Edge cases that diverge (URLs without trailing slash like
"http://foo.bar", or referers with no protocol like "mail.ru/foo")
get bucketed by host instead of being passed through unchanged.
That's arguably more correct for the query's intent (group by host),
and the affected rows are a tiny minority — not worth a CASE WHEN
just to mirror REGEXP_REPLACE's pass-through behaviour on
malformed referers.

Verified: query runs cleanly on the MCS test instance and the rewrite
matches REGEXP_REPLACE on a 6-row sample of typical referers from
the dataset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stock SQLite (3.46 in this run, but the same on every Ubuntu version
ClickBench targets) does not ship REGEXP_REPLACE — it requires a
build-time --enable-regexp flag or a loaded extension. So Q29 has been
returning [null,null,null] on every SQLite run with "Parse error: no
such function: REGEXP_REPLACE" in the log.

Same shape of fix as the recent mariadb-columnstore rewrite: pull the
host out with INSTR + SUBSTR + IIF instead of a regex. SQLite has no
SUBSTRING_INDEX so each "split" needs its own subquery layer, but the
algebra is identical:

  level 1: take everything after "://"  (or whole string if no protocol)
  level 2: take everything before first "/"  (or whole string if no path)
  level 3: strip leading "www."

Verified on a synthetic table with the typical referer shapes — the
rewrite produces the same hostname keys as REGEXP_REPLACE for every
URL with a protocol-and-path. URLs without a trailing "/" or without
a protocol get bucketed by host instead of pass-through, which is
consistent with mariadb-columnstore and arguably more correct for
the query's intent (group by host).

Other systems that fail Q29 for the same root cause (REGEXP_REPLACE
unsupported) but couldn't be tested locally:

  * cratedb       — uses '$1' (PostgreSQL-style); current result also
                    null, error not yet inspected — may need rewrite
  * elasticsearch — ES SQL surface is intentionally limited; needs an
                    ES-specific rewrite (no SUBSTRING_INDEX, no IIF)

Historical-only systems left as-is (results from 2020–2022, not in
active CI): aurora-mysql, druid, heavyai, infobright, monetdb, pinot,
singlestore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both installs hard-coded x86_64 release URLs, so every c8g.* run
downloaded an amd64 binary that immediately died on aarch64:

* databend — exec'd briefly, then ./check looped 600 s on
  "Failed to connect to localhost:8124"; produced 0 results.
* octosql — failed at first invocation with "cannot execute binary
  file: Exec format error" before any query ran.

Both projects ship matching arm64 nightlies/releases. Resolve `uname -m`
in install and pick the corresponding tarball. Also point databend at
the renamed databendlabs/databend org (datafuselabs/* still 301s but
no reason to keep the old path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recent c8g.* sweep surfaced systems that download an x86_64
binary or pull an amd64-only Docker image and then fail late with
opaque errors after a 14 GB hits.parquet download or a 600 s
./check timeout. Where upstream simply doesn't publish an arm64
artifact, abort install with a clear message instead of letting
the run burn an EC2 instance to discover the mismatch.

Affected:

* hyper, hyper-parquet — tableauhyperapi has Linux x86_64
  manylinux2014 wheels only (no aarch64).
* doris, doris-parquet — Apache Doris release tarballs are
  apache-doris-*-bin-x64.tar.gz; no arm64 mirror exists.
* citus — citusdata/citus is amd64-only on every tag (Docker Hub
  manifest), runs under QEMU on arm64 and never starts in time.
* pgpro_tam — innerlife/pgpro_tam is amd64-only.

opteryx is the one fixable case in this batch: 0.26.1 only ships
x86_64 wheels and the sdist build was breaking on arm64
("third_party/abseil/containers.pyx doesn't match any files");
0.26.8 publishes manylinux2014_aarch64 wheels for cp310-cp313, so
bumping the pinned version unblocks c8g.* without affecting x86_64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pgducklake container ships with duckdb.memory_limit defaulted to
4 GB, so the CTAS over the 14 GB hits.parquet failed mid-load with

  Out of Memory Error: failed to allocate data of size 256 KiB
  (4.0 GiB/4.0 GiB used)

even on c8g.metal-48xl (384 GB RAM). On the c7a.metal-48xl run the
same OOM surfaced one query later as "Current transaction is aborted
(please ROLLBACK)" because the failed CTAS poisoned the session.

Compute 80 % of host MemTotal and SET the memory_limit on the same
session that runs create.sql (psql with -c followed by -f resets the
session between commands, so pipe both through stdin instead).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant